The Collatz Conjecture is a mathematical hypothesis named after a German Mathematician, Lothar Collatz, who may have come up with it in 1930s. This hypothesis revolves around a sequence defined by the following rules:
Begin with a positive integer (n).
Generate each subsequent term based on the following conditions:
If the previous term is even, the next term is half of the previous term.
If the previous term is odd, the next term is obtained by multiplying the previous term by 3 and then adding 1.
The conjecture assume that irrespective of the initial positive integer chosen, the sequence will always eventually reach the value 1.
Firstly, create gen_collatz function that takes a
positive integer n as input and generates the Collatz
sequence until it reaches 1.
gen_collatz <- function(n) {
if (!is.integer(n) || n <= 0) {
stop("Input must be a positive integer")
}
seq <- c(n)
while (n != 1) {
if (n %% 2 == 0) {
n <- n / 2
} else {
n <- 3 * n + 1
}
seq <- c(seq, n)
}
return(seq)
}
Next, create collatz_df tibbles that stores the Collatz
sequence for starting integers ranging from 1 to 10,000. This tibble
contains five columns:
start - the starting integer value,
seq - the Collatz sequence saved as a list,
length - the length of the sequence,
parity - even or odd starting integer,
max_val - the maximum value in the
sequence.
library(tidyverse)
start <- 1:10000
seq <- list()
for (i in start) {
collatz_seq <- gen_collatz(i)
seq[[i]] <- collatz_seq
}
length <- double()
for (i in start) {
length[i] <- length(gen_collatz(i))
}
parity <- ifelse(start %% 2 == 0,
"Even",
"Odd")
max_val <- double()
for (i in start) {
max_val[i] <- max(gen_collatz(i))
}
collatz_df <- tibble(start,
seq,
length,
parity,
max_val)
Using {tidyverse} data wrangling techniques, we will
analyze the data to provide essential insights into the behavior of the
Collatz Conjecture sequences. Then, using {ggplot2}, we
will create the appropriate graphs that visualize the data wrangling
tasks.
top10longest <- collatz_df %>%
arrange(desc(length)) %>%
slice_head(n = 10) %>%
pull(start)
Below is the scatter plot of the sequence lengths, with the starting integer on the horizontal axis and the length of the sequence on the vertical axis. The top 10 starting integers are highlighted and labeled in blue.
Below is the code for the plot:
library(ggplot2)
library(ggrepel)
# Scatterplot of all the sequence lengths
plot1<- ggplot( data = collatz_df,
mapping = aes(x = start,
y = length)
)+
geom_point()+
labs(
title = "Scatterplot of Sequence Lengths",
x = "Starting Integer",
y = "Length of Sequence"
)
# To find the top 10 longest starting integers
top10length <- collatz_df %>%
arrange(desc(length)) %>%
top_n(10,length)
# To identify the top 10 longest starting integers in the scatterplot
scatterplot1 <-
plot1 +
geom_point(data = top10length,aes(colour = "Top 10"))+
scale_colour_manual(values = c("Top 10" = "blue"))+
geom_text_repel(data = top10length, aes(label = start))
max_val_int <- collatz_df %>%
slice(which.max(max_val)) %>%
pull(start)
max_val_int = 9663
Below is the scatter plot of the highest value reached by each starting integer, with the starting integers in the horizontal axis, and the maximum values in the vertical axis. The top 10 starting integers are highlighted and labeled in red.
We can see that starting integer 9663 from
max_val_int, indeed has the highest maximum value.
Below is the code for the plot:
# Scatterplot of the highest values of starting integers
plot2<- ggplot( data = collatz_df,
mapping = aes(x = start,
y = max_val)
)+
geom_point()+
labs(
title = "Scatterplot Of Maximum Value Reached In Sequence",
x = "Starting Integer",
y = "Maximum Value"
)
# To find the top 10 starting integers with the highest value
sortedvalue <- collatz_df %>% arrange(desc(max_val))
top_10_value <- sortedvalue[1:10,]
# To highlight the top 10 starting integers
scatterplot2 <-
plot2 +
geom_point(data = top_10_value,
aes(colour = "Top 10"))+
scale_colour_manual(values = c("Top 10" = "red"))+
geom_text_repel( data = top_10_value, aes(label = start))
even_odd_avg_len <- collatz_df %>%
group_by(parity) %>%
summarise(mean(length)) %>%
pull(2)
even_odd_sd_len <- collatz_df %>%
group_by(parity) %>%
summarise(sd(length)) %>%
pull(2)
even_odd_avg_len = 79.5936
92.3396
even_odd_sd_len = 45.10308
47.18387
Shown below are box plots to compare the distribution of sequence lengths for even and odd starting integers, with the parity on the horizontal axis and the length of sequence on the vertical axis.
Lets denote the box plot of even starting integers are “Box plot A”, and the box plot of odd starting integers as “Box plot B”. As we can see above, there are a few notable differences between box plot A and box plot B, which are as follows:
Outliers are only present in Box plot B
The median in Box plot B is higher than in Box plot A, which indicates that odd starting integers result in longer sequences.
Below is the code for the plot:
# Boxplot of distribution of sequence length for even,odd starting integers
ggplot( data = collatz_df,
mapping = aes( x = parity,
y = length))+
geom_boxplot()+
labs(
title = "Boxplots Comparing Distribution of Even and Odd Starting Integers",
x = "Parity",
y = "Length of Sequence"
)
When a series hits a number that is less than the initial integer but subsequently rises beyond it at least once more before reaching 1, backtracking has occurred. Hence, let’s investigate them!
collatz_df to retain starting integers
that exhibit backtracking in their sequence.has_backtracking <- function(seq) {
if (length(seq) <= 2) {
return(FALSE)
}
for (i in 2:(length(seq) - 1)) {
if (seq[i] < seq[1] && seq[i + 1] > seq[1]) {
return(TRUE)
}
}
return(FALSE)
}
backtracks_df <- collatz_df %>%
filter(sapply(seq, has_backtracking))
A function
has_backtrackingwas created to identify and check if the backtracking has occurred with inputseq.
count_backtrack <- function(seq) {
sum(seq > seq[1])
}
mode_backtrack <- backtracks_df %>%
mutate( backtrack_counts = sapply(seq, count_backtrack)) %>%
count(backtrack_counts) %>%
arrange(desc(n)) %>%
pull(backtrack_counts) %>%
first()
Another function
count_backtrackwas introduced to count number of backtracks that has occurred with inputseq.
first_backtrack <- function(seq) {
start <- seq[1]
first <- start
reached_backtrack <- FALSE
for (i in seq) {
if (reached_backtrack && i > first) {
first <- i
}
if (i < start) {
reached_backtrack <- TRUE
}
}
return(first)
}
max_after_backtrack <- backtracks_df %>%
mutate(max_after_backtrack = sapply(seq, first_backtrack)) %>%
pull(max_after_backtrack)
Lastly, function
first_backtrackwas created to find the first backtrack value for the sequences
even_odd_backtrack <- backtracks_df %>%
group_by(parity) %>%
summarise(frequency = n()) %>%
pull(2)
even_odd_backtrack = 3943
4286
Hence, backtracking sequences are more common in odd starting integers compared to even starting integers.
Investigating the correlation between the starting integers and the number of even and odd numbers in the sequence
odd_counts and
even_counts, which are the frequency of of odd and even
numbers in the sequence.odd_numbers_in_seq <- function(x) {
odd_count <- sum(x %% 2 != 0)
return(odd_count)
}
odd_counts <- double()
for (i in start) {
odd_counts[i] <- odd_numbers_in_seq(collatz_df$seq[[i]])
}
even_numbers_in_seq <- function(x) {
even_count <- sum(x %% 2 == 0)
return(even_count)
}
even_counts <- double()
for (i in start) {
even_counts[i] <- even_numbers_in_seq(collatz_df$seq[[i]])
}
cor() to compute the correlation
and store it as correlation_start.correlation_start <- collatz_df %>%
mutate(even_counts, odd_counts) %>%
select(-seq, -parity) %>%
cor()
> correlation_start
start length max_val even_counts odd_counts
start 1.00000000 0.2054051 0.08813284 0.2212319 0.1798567
length 0.20540510 1.0000000 0.17133951 0.9998222 0.9995421
max_val 0.08813284 0.1713395 1.00000000 0.1719367 0.1702539
even_counts 0.22123190 0.9998222 0.17193671 1.0000000 0.9987938
odd_counts 0.17985668 0.9995421 0.17025389 0.9987938 1.0000000
The correlation coefficient between the starting integers and the number of even numbers in the sequence is 0.22123, which is low and positive.
The correlation coefficient between the starting integers and the number of odd numbers in the sequence is 0.17986, which is also low and positive.
The correlation coefficient between the number of even and odd numbers in the sequence is 0.99879, which is high (close to 1) and positive.
The first two plots represents the relationship between the starting integer of the sequence and the number of even and odd numbers in the sequence respectively.
The last plot represents the relationship between the number of odd and even numbers in the sequence.
The code below is used for creating a visualization to show whether there is a relationship between two variables.
start_even_counts <- collatz_df %>%
mutate(even_counts) %>%
select(-seq, -parity) %>%
ggplot(., aes(x = start,
y = even_counts)) +
geom_point(color = "skyblue") +
geom_smooth(method = "lm",
se = FALSE,
fullrange = TRUE,
color = "red") +
labs(x = "Starting integers",
y = "Even numbers in the sequence") +
theme_minimal()
start_odd_counts <- collatz_df %>%
mutate(odd_counts) %>%
select(-seq, -parity) %>%
ggplot(., aes(x = start,
y = odd_counts)) +
geom_point(color = "skyblue") +
geom_smooth(method = "lm",
se = FALSE,
fullrange = TRUE,
color = "red") +
labs(x = "Starting integers",
y = "Odd numbers in the seq") +
theme_minimal()
even_odd_counts <- collatz_df %>%
mutate(odd_counts, even_counts) %>%
select(-seq, -parity) %>%
ggplot(., aes(x = even_counts,
y = odd_counts)) +
geom_point(color = "skyblue") +
geom_smooth(method = "lm",
se = FALSE,
fullrange = TRUE,
color = "red") +
labs(x = "Even numbers",
y = "Odd numbers") +
theme_minimal()
library(ggpubr)
ggarrange(start_even_counts, start_odd_counts, even_odd_counts,
labels = c("p = 0.22123", "p = 0.17986", "p = 0.99879"),
hjust = -2,
font.label = list(size = 9),
ncol = 2, nrow = 2)
ggarrange()is used to fit multiple plots in one image. Before using this, it is important to installggpubrfirst.
For this section we will look into 3 different visualizations for Collatz Conjecture;
Here is a plot of starting integers up to 10,000, with the largest maximum value reached by each starting integer plotted on the y-axis. The y-axis stopped at 100,000, but not all starting integers can be shown at this scale. For example, when n = 9663, the largest value reached climbs as high as 27 million.
Below is the code to this plot:
collatz_df %>%
unnest(seq) %>%
group_by(start) %>%
filter(start %in% 1:10000) %>%
slice_max(order_by = seq) %>%
ggplot(.,
aes(x = start,
y = seq)) +
geom_point(aes(col = start,
alpha = length),
size = 1) +
labs(
title = "Collatz Conjecture",
subtitle = "Max value reached by each starting integer ",
x = "Starting Integer",
y = "Value"
) +
theme_minimal() +
xlim(0, 10000) +
ylim(0, 100000)
Here is a plot of the numerical progression of each starting integer from 1 to 30. Interestingly, the starting integer n = 27, goes through 112 steps to finally reach 1.
Below is the code to this plot:
collatz_df %>%
unnest(seq) %>%
group_by(start) %>%
filter(start %in% 1:30) %>%
mutate(steps = row_number()) %>%
ggplot(.,
aes(x = steps,
y = seq)) +
geom_line(aes(col = length)) +
facet_wrap(start ~ length, scales = "free") +
labs(
title = "Collatz Sequence Line Plot",
subtitle = "Numerical Progression for each starting integer",
x = "Steps",
y = "Value"
) +
theme_classic()
You may replace the starting integer filter “1:30” to any numbers you want to look at their numerical progression.
Here is a hexagonal plot of the length of each starting integers from 1 to 10,000. For every hexagon, you can check how many data points there are which leads to the count. As you can see, lengths around 50 and 130 are more common compared to the rest.
Perhaps, by having more starting integers (i.e 1 to 100,000), a better hexagonal representation can be made to see which lengths are more common.
Below is the code to this plot:
collatz_df %>%
ggplot(.,
aes(x = start,
y = length)) +
geom_hex(bins = 20) +
scale_y_continuous(breaks = seq(0, 275, by = 25)) +
scale_fill_viridis_c() +
theme_minimal()
| Tasks | Done by |
|---|---|
| Task 1 | @HafizNjame |
| Task 2 | @HafeezulRaziq |
| Task 3 | @20B2097 |
| Task 4 | @dnshzm |
| Task 5 | @HafizNjame |
| Task 6 | @HafeezulRaziq |
| README | @HafizNjame @dnshzm @20B2097 @HafeezulRaziq |